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Abstract We study gaze estimation on tablets; our key 
design goal is uncalibrated gaze estimation using the 
front-facing camera during natural use of tablets, where 
the posture and method of holding the tablet is not 
constrained. We collected the first large unconstrained 
gaze dataset of tablet users, labeled Rice TabletGaze 
dataset. The dataset consists of 51 subjects, each with 4 
different postures and 35 gaze locations. Subjects vary 
in race, gender and in their need for prescription glasses, 
all of which might impact gaze estimation accuracy. 
Driven by our observations on the collected data, we 
present a TabletGaze algorithm for automatic gaze es¬ 
timation using multi-level HoG feature and Random 
Forests regressor. The TabletGaze algorithm achieves 
a mean error of 3.17 cm. We perform extensive evalu¬ 
ation on the impact of various factors such as dataset 
size, race, wearing glasses and user posture on the gaze 
estimation accuracy and make important observations 
about the impact of these factors. 

Keywords Eye • Gaze Estimation/Tracking • Dataset • 
Mobile Device • Applications 

1 Introduction 

Tablets are now a commonplace connected mobile com¬ 
puting device, and are in use worldwide for diverse 

Q. Huang (El) 

ECE Department, Rice University, Houston, TX, USA 
E-mail: qh3@rice.edu 

A. Veeraraghavan 

ECE Department, Rice University, Houston, TX, USA 
E-mail: vashok@rice.edu 

A. Sabharwal 

ECE Department, Rice University, Houston, TX, USA 
E-mail: ashu@rice.edu 


applications. Current user-tablet interactions are mainly 
enabled by touch and sound. However, gaze is an emerg¬ 
ing proxy of the user’s attention and intention [10]. Gaze 
information has the potential to enable a wide array of 
useful applications on tablets, including: i) hands-free 
human device interaction, such as using gaze to control 
the device when certain regions of the screen are hard to 
reach [26]; ii) behavior studies, such as using gaze path 
information for understanding and tracking reading be¬ 
havior [15]; and iii) user authentication when gaze-based 
feature is used as a biometric [20]. In the future, many 
other applications could be enabled by gaze tracking on 
tablets. 

In this paper, we study gaze estimation on the cur¬ 
rent generation of tablets, without requiring any ad¬ 
ditional hardware. Nearly all modern tablets include 
front-facing cameras. Our approach will be to leverage 
images from the front-facing cameras for gaze estimation 
and tracking (gaze estimation at frame rate), thereby 
making the resulting system suitable for today’s tablets. 

We adopt an appearance-based gaze estimation ap¬ 
proach, since it does not need a calibration stage of¬ 
ten required by many existing approaches [3, 14, 29]. 
Appearance-based methods find a regression mapping 
from the appearance of eye-region images to the gaze 
direction, which is then be applied to new unseen eye 
images. In this way, a regression model could be trained 
off-line, and then loaded on any tablet, estimating gaze 
using recorded images for any user. 

A key challenge in tablet gaze tracking is the ability 
to robustly handle unconstrained use of tablets. During 
user-tablet interaction, there is often head motion, hand 
movement and change of body posture. As a result, 
shifts in the viewing angle, changes of distance between 
the user and the screen, and variations in illumination 
are possible. Moreover, any useful method should also 
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be capable of tolerating variations in features of subject 
population, such as eye shape, skin and iris color, wear¬ 
ing glasses or not and so on. To handle the challenges, 
the mobile gaze tracking algorithm should be free of 
three constraints: i) no constraint on how people use the 
tablet; ii) no constraint on what kind of body posture 
people have when using the tablet; and hi) no constraint 
on the user of the tablet. 

While unconstrained gaze estimation is practically 
very useful, there exist no standard datasets to evalu¬ 
ate the reliability and accuracy of gaze estimation algo¬ 
rithms. 

We study the unconstrained mobile gaze estimation 
problem in three steps. First, we collected an uncon¬ 
strained mobile gaze dataset of tablet users from 51 
subjects. We name the dataset Rice TabletGaze dataset. 
To the best of our knowledge, this dataset is the first of 
its kind and is released online for research community 
[http: // sh. vice, edu/ tahlet_ gaze. html). While 
the dataset is collected with one tablet, gaze estimation 
models trained from this dataset are applicable to other 
handheld devices, by learned mapping between device 
specifications such as camera location on the tablet. The 
dataset consists of video sequences that were recorded 
by the tablet front-facing camera while subjects were 
looking at a dot appearing randomly on tablet screen 
at one of the 35 predefined locations. Subjects in the 
dataset are of diverse ethnic backgrounds, and 26 of 
them wear prescription glasses. During the data collec¬ 
tion process, subject motion was not restricted, and 
each subject performed four body postures: standing, 
sitting, slouching, and lying. Due to our protocol design, 
natural and realistic subject appearance variations are 
captured in the dataset. We obtain a subset of our full 
dataset, consisting of around 100,000 images from 41 
subjects. The subset is labeled with ground truth 2D 
gaze locations (x and y coordinates on the tablet screen), 
and used extensively in this paper. 

We also present the TabletGaze algorithm to esti¬ 
mate a user’s gaze given an image recorded by the tablet 
front camera. The appearance-based TabletGaze algo¬ 
rithm is composed of standard computer vision building 
blocks. In the algorithm, the eyes in the image are first 
detected by a cascade eye detector [46], and then a 
tight region around the eyes is cropped. A multi-level 
HoG (mHoG) [21] feature is then extracted from the 
cropped eye images, and Linear Discriminant Analysis 
(LDA) is applied subsequently to reduce the feature 
dimensionality. The final feature is fed into a Random 
Forests (RF) [2] regressor, which outputs the location 
on the tablet screen at which the person in the image is 
gazing. The optimal combination of eye-region feature 
(mHoG) and regression model (RF) is found through 
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Fig. 1 In this work, we provide the first dataset and an 
automatic algorithms for unconstrained gaze estimation on 
tablets. The mean error we obtained is indicated in the figure. 
A variety of useful applications can be enabled through gaze 
estimation on mobile device. 


performance comparison of 5 different features and 4 
regressors on the Rice TabletGaze dataset. Then we 
evaluate the algorithm’s performance through extensive 
experiments on the Rice TabletGaze dataset. The al¬ 
gorithm is evaluated on both person-independent and 
person-dependent training scenarios. We also extensively 
evaluate and analyze the impact of factors that could 
affect gaze estimation accuracy, including dataset size, 
race, prescription glasses and user posture. Lastly, we 
applied the algorithm to videos in the dataset to show 
continuous tracking results and demonstrated that the 
error variance can be reduced by using a bilateral filter. 
An overview of the gaze estimation system setup, the 
average result, and applications of gaze estimation are 
shown in Fig. 1. 

In summary, this paper makes three key contribu¬ 
tions: 

i) Rice TabletGaze DataSet: a large gaze dataset 
was collected in an unconstrained mobile environment, 
capturing natural and realistic subject appearance varia¬ 
tions. This dataset is publicly available at http: //sh. 
vice, edu/ tab Iet_ gaze, html for research purposes. 

ii) TabletGaze Algorithm: An automatic gaze esti¬ 
mation algorithm is presented, and achieves a mean 
error (ME) of 3.17 cm on the tablet screen, which is 
significant improvement over prior art. 

iii) Analysis: The study of the impact of training 
data size shows that the estimation accuracy can be 
further increased by collecting more data. We also show 
that for a large training dataset, dividing the dataset 
based on racial characteristics and body postures could 
improve the overall accuracy. However, partitioning the 
dataset based on whether or not the subject is wearing 
eyeglasses does not change the algorithm’s performance. 
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2 Related Works 

We focus on estimating the 2D location on the tablet 
screen where the user’s eyes are focused instead of 3D 
gaze direction in space. A detailed summary of gaze 
direction estimation can be found in the following review 
paper [11]. 

2.1 Point of Gaze Estimation for Stationary Displays 

Gaze estimation methods are typically categorized into 
two main groups: geometry-based, and appearance-based 
[11]. Geometry-based methods rely on the tracking of 
certain eye features, such as the iris [14], pupil center 
[23, 31], or Purkinje images [12]. To robustly track the 
features, those methods require extra infra-red illumina¬ 
tion source(s), multiple cameras with calibration, and 
sometimes session-dependent personal calibration. 

2.1.1 Geometry-based Methods 

Geometry-based methods utilize explicit 3D eye ball 
models along with the tracked eye features to estimate 
the 3D gaze direction. The point of gaze is then found 
through the intersection of gaze direction and the screen. 
Based on the pupil center and Purkinje image from one 
camera and an infrared LED array, a double ellipse fit¬ 
ting mechanism was proposed in [29] to predict the gaze. 
However, the system required a fixed distance between 
the display and the user, and head motion was limited 
to a 4-cm-square area. Meanwhile, an approach free of 
user calibration was presented in [35]. Two cameras and 
two point light sources that were calibrated and not co- 
linear were used to find the 3D locations of the cornea 
and pupil centers. The gaze direction was computed by 
connecting the cornea center and pupil center. Another 
approach, proposed in [41], used a single image of one 
eye to estimate gaze direction. The iris contour in the 
image was modeled using an ellipse. The ellipse was 
then back projected into an iris circle, whose normal 
was regarded as the gaze direction. 

2.1.2 Appearanee-hased Methods 

Appearance-based methods [1, 19] treat the eye region 
image or features extracted from the eye region image 
as a high dimensional vector, and learn a regression 
mapping model from such vector to the point of gaze 
(or gaze direction) through labeled training data. Such 
methods have the potential to be non-intrusive, free of 
calibration and can operate free of external hardware. 
A variety of regression models were utilized to find the 
mapping from the eye appearance to point of gaze (or 


gaze direction) in different works. In [38], eye images 
were modeled as an appearance manifold. The gaze di¬ 
rection of a new sample was obtained from a linear 
interpolation of neighboring samples in the manifold 
model. This method was evaluated only on three sub¬ 
jects with fixed head pose. It used leave-one-image-out 
cross validation so a test subject’s data appeared in the 
training phase. On the other hand, in [43], a sparse, 
semi-supervised Gaussian Process Regression model was 
applied to deal with partially labeled gaze data, and real¬ 
ized real-time prediction of gaze direction. The method 
was evaluated using test images corresponding to unseen 
gaze locations in the training images. However, there 
was no description on whether a subject’s data appear 
both in the training and testing process. In addition, 
there was also no description on whether the data was 
collected from subjects with a fixed head pose. 

In some works[I8, 37], 3D head pose information 
is extracted from images to compensate for head mo¬ 
tion and improve gaze estimation accuracy. A two step 
scheme was introduced in [18] to estimate gaze direction 
under free head motion. The method first estimated an 
initial gaze direction from computed eye features under 
a fixed head pose, then corrected the gaze direction 
based on head pose rotation and eye appearance distor¬ 
tion. The method was also only evaluated for person- 
dependent scenario. In [37], the authors collected a large 
gaze dataset with multiple head poses. Synthesized eye 
images were generated through 3D reconstruction of 
the eye region to provide more data for denser viewing 
angles. Then, a gaze estimation model was trained us¬ 
ing random forest on the synthesized images. Einally, 
a person-independent evaluation was performed on the 
dataset. 

In the meantime, several datasets were released to 
the public for stationary displays. In [36], Smith et al. in¬ 
troduced a gaze dataset composed of 5,880 images from 
56 subjects. The images were recorded from a fixed 
distance to the subjects in a controlled environment, 
while they looked at each one of 21 pre-defined gaze 
locations. The gaze directions were coarsely arranged in 
seven horizontal by five vertical angles. Though five hor¬ 
izontal head poses were captured, the vertical head pose 
was fixed. Sugano et al. [37] collected a large dataset 
with 64,000 images from 50 subjects. The images have 
a much denser sampling of gaze angles, with 16 hori¬ 
zontal and 8 vertical gaze directions, and eight head 
poses. The images were also collected from a fixed dis¬ 
tance to the subjects in a controlled environment. A 
benchmark dataset was proposed in [24] for evaluation 
of the performance of different gaze tracking/estimation 
algorithms. The dataset contains videos recorded by 
both color and depth cameras, and features the varia- 



4 


Q. Huang et al. 


tion in head pose, type of gazing target, and ambient 
condition. However, the dataset included data from only 
16 subjects, and only 3 subjects’ data was recorded in 
two different ambient conditions. While all of the above 
mentioned datasets captured extensive amounts of head 
poses and appearances, the experiments were conducted 
in a tightly controlled manner and do not vary in body 
posture, which is different from our dataset that is more 
specifically targeting the mobile usage. 


2.2 Point of Gaze Estimation for Mobile Displays 

Only a few works discussed gaze estimation methods 
for mobile devices, and most of those works were ex¬ 
ploratory, directly applying previously presented meth¬ 
ods to mobile devices. In [7], the authors proposed using 
gaze gestures to control mobile phones, in comparison 
of gaze dwell duration, and showed the potential to 
improve gaze tracking accuracy by using gaze gestures. 
A commercial gaze tracker was utilized to locate the 
user’s gaze location on the phone screen. The change 
of gaze locations was then converted to gaze gestures. 
This paper studied only the usability of gaze gestures 
to control mobile devices based on gaze tracking results, 
not gaze tracking itself. Nagamatsu et al. [27] adopted 
the gaze tracking method proposed in [25], utilizing 
two cameras and two light sources to find the 3D gaze 
direction on a mobile phone. A one point personal cali¬ 
bration was used to find the offset between the optical 
and the visual axis. The system was claimed to work 
under free hand movement, but there was no quantita¬ 
tive evaluation presented. Kunze et al. [16] implemented 
an application on mobile tablets and phones to accu¬ 
mulate statistics about user’s reading behaviors. They 
compared the performance of one appearance-based and 
one geometry-based gaze tracking method, and reached 
the conclusion that both methods are highly dependent 
on not only the calibration phase but also the position 
in which the device was held. However, there was also 
no quantitative evaluation regarding the accuracy of 
the different methods. In [44], an on-device gaze track¬ 
ing prototype was implemented using a geometry-based 
gaze estimation method on an unmodified tablet. The 
algorithm fitted an ellipse to eye limbus within the 
region-of-interest (ROI) detected by eye detectors, and 
found the optical axis through the ellipse normal vector. 
No user calibration was performed to correct the error 
between the optical and the visual axis. The optical axis 
was directly treated as the gaze direction. An accuracy 
of 6.88° was claimed in the work. However, the method 
was evaluated only on 8 subjects, and subject-tablet 
distance was fixed in the experiments. Furthermore, the 


gaze locations included only 9 dots on the screen, cover¬ 
ing part of the available tablet surface. Recently Zhang 
et. al [47] presented a gaze dataset collected under free 
laptop use with 15 participants. The dataset contains 
213,659 images and has 20 gaze locations. An algorithm 
was also presented in the work utilizing multimodal 
convolutional neural networks (CNN) to predict gaze 
direction from head pose and eye appearance. Though 
laptops are technically a mobile device, they have much 
less mobility compared to handheld devices like tablets 
and phones. In addition, the statistics from this work 
showed that the majority of the data was collected dur¬ 
ing work time, when people would more likely put their 
laptops on the desk. A major impact from this differ¬ 
ence is that the users face is fully visible , while it is 
certainly not the case for tablets, as is shown in our 
work. Furthermore, the algorithm presented requires 
camera calibration and a pre-built facial shape model. 

Our work is the first to study unconstrained hand¬ 
held mobile device gaze estimation. Our gaze dataset 
was collected with free subject motion and different 
body postures, greatly capturing the appearance varia¬ 
tions in unconstrained environments. In addition, our 
algorithm is fully automatic and is developed based on 
the observations made on the Rice Tablet Gaze dataset. 
Our study on the impact of practical factors on the algo¬ 
rithm performance such as prescription glasses and body 
posture, as well as our evaluation of continuous gaze 
tracking, help us understand mobile gaze estimation and 
its practicality. 


3 Rice TabletGaze Dataset 

We created the first publicly available unconstrained 
mobile gaze dataset. Rice TabletGaze Dataset, to pro¬ 
vide data for our study of the unconstrained mobile gaze 
estimation problem. We designed our data collection 
experiments to capture unique, unrestrained character¬ 
istics in the mobile environment. To this end, we have 
collected data from 51 subjects, each with four different 
body postures. The dataset is also released online to 
promote future research development of unconstrained 
gaze estimation methods. While all the data in this 
paper is recorded with one tablet, one could potentially 
train a gaze estimation model from this dataset, and the 
learned model can be used for gaze estimation on other 
handheld devices through approaches that use transfer 
learning, domain adaptation or by directly encoding 
the relative location and resolution of the cameras in 
the two devices. While, we believe this is feasible, it is 
outside the scope of this paper. 
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3.1 Data Collection 

In this research, we used a Samsung Galaxy Tab S 10.5 
tablet with a screen size of 22.62 x 14.14 cm (8.90 x 
5.57 inches). A total of 35 gaze locations (points) are 
equally distributed on the tablet screen, arranged in 5 
rows and 7 columns and spaced 3.42 cm horizontally and 
3.41 cm vertically. Example images of the gaze pattern 
on the tablet screen is shown in Fig. 2. The raw data 
are videos captured by the front-camera of the tablet 
that was held in landscape mode by the subjects, with 
an image resolution of 1280 x 720 pixels. 

A total of 51 subjects, 12 female and 39 male, partic¬ 
ipated in the data collection, with 26 of them wearing 
prescription glasses; 28 of the subjects are Caucasians, 
and the remaining 23 are Asians. The ages of the sub¬ 
jects range approximately from 20 to 40 years old. An 
institutional review board (IRB) approval is obtained 
for the research and all subjects signed a consent form to 
allow their data to be used in the research and released 
online. 

During each data collection session, the subject held 
the tablet in one of the four body postures (standing, 
sitting, slouching or lying) as shown in Fig. 3, and 
recorded one video sequence. Each subject needed to 
conduct four recording sessions for each of the four body 
postures, so a total of 16 video sequences were collected 
for each subject. For each recording session, there was 
no restriction on how the subject held the tablet or how 
they performed each body posture. The data collection 
happened in a naturally lit office environment, where 
only the ceiling lights directly on top of the subjects 
were turned off to reduce the strong background light 
in the recorded videos. 

When a subject started one data collection session, 
he or she initialized a background recording applica¬ 
tion on the tablet, so the front facing camera of the 
tablet began recording a video of the subject with audio. 
Then the subject started to play and watch a video on 
the tablet. A beep sound notified the beginning of the 
video, which was also recorded in the video sequence. 
The recorded sound would be utilized later to locate 
the time instant in the recorded video when the subject 
started to watch the video. The video watched by the 
subjects consists of a dot changing its location every 
three seconds, and the subject was instructed to focus 
his/her eyes on the dot the whole time. The subject 
was free to blink his/her eyes, as it would be uncomfort¬ 
able to restrain the eye blink in each approximately two 
minute long data collection session. To prevent the sub¬ 
ject from focusing his eyes to the next gaze point ahead 
of time (i.e. predicting the dot location), the location of 


k ->1 3.4 cm 


14.14 cm 


|<- 22.62 cm ->| 

Fig. 2 Gaze locations on the tablet screen. There are 35 (5 
X 7) locations distributed on the tablet screen. In one data 
collection session, a dot appeared at one location at a time, 
and then moved to another location after 3 seconds. This 
continued until the dot had appeared at all the 35 locations 
once. The location of the dot was randomized among the 35 
points. 




Fig. 3 An example image of the data collection process. In 
one data collection session, a subject maintains one of four 
body postures while gazing at a dot on the tablet screen. 
At the same time, a video of the subject is recorded by the 
tablet front camera. From left to right, the subject is standing, 
sitting, slouching and lying. 


the dot was randomized among the 35 possible points. 
Sample images from the dataset are shown in Fig. 4. 


3.2 Observations on the Rice TabletGaze Dataset 

In this section, we discuss our observations about facial 
visibility, body posture and prescription glasses, based 
on our TabletGaze dataset described in Section 3.1. 

Observation 1: The entire face may not be 
visible in most of the image frames. 

Fig. 5 shows an example of full range of facial visibil¬ 
ity for the same subject during different data collection 
sessions. The images vary from full facial visibility to 
only the subject’s forehead being visible. To quantify the 
extent of facial visibility, we labeled each video in the 
TabletGaze dataset as belonging to one of the following 
five categories: (i) the whole face; (ii) from mouth and 
above; (iii) from nose and above; (iv) from eyes and 
above; and (v) even the eyes are not visible. For each 
video sequence, we manually reviewed 4 images (each 
image corresponds to 1 of the 4 corner gaze locations on 
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Standing Sitting Slouching Lying 


Fig. 4 Sample images from the Rice TabletGaze dataset. We observe subject appearance variations across different recording 
sessions. Also, notice that only a fraction of the subject’s face is visible in a fraction of the images in the dataset. 


the tablet screen) and determined the facial visibility 
extent of each image. The video sequence is labeled as 
the majority category of the 4 images. The statistics 
based on the above categorization are shown in Table 1. 

We observe that the whole face is visible in only 
30.8% of all the videos, and the number varies from one 
posture to another, with sitting being the highest (47%) 
and lying being the lowest (13.7%). It is clear that in a 
strong majority of the videos, full facial visibility cannot 
be assumed. 

The extent of facial visibility directly affects the 
amount of information that can be extracted from the 
facial region for gaze estimation. For example, head 
pose information (pitch, yaw and roll angles) can be 
estimated from the face, and can be used in conjunc¬ 
tion with eye appearance information to improve gaze 
estimation. The details were discussed in Section 2. The 
bulk of previously proposed head estimation methods 
[4, 34, 40] require the whole face to be visible, and are 
not effective when only part of the face is visible. Due 
to a lack of robust methods for extracting head pose 
estimation using partial face visibility, we largely focus 
on eye region appearance in this paper. However, we did 
perform preliminary work to incorporate implicit head 
pose information, such as eye locations in the image 
frame, as discussed in detail in Section 6. In addition, 
methods that incorporate head pose information for 
those frames where the entire face is visible will poten¬ 
tially improve gaze estimation accuracy, though such an 
investigation is outside the scope of this paper. 

Observation 2: Body posture and facial visi¬ 
bility extent appear to be correlated. 



Fig. 5 Example images of five different levels of facial vis¬ 
ibility categories. From left to right, each image represents 
one of the five following visibility categories: i) the whole face, 
ii) from mouth and above, hi) from nose and above, iv) from 
eyes and above, and v) where even the eyes are not visible. 
For clarity of presentation, we have cropped the background. 

Table 1 Statistics on the extent of the visible face region. 
Each video in the dataset is labeled as one of the five facial 
visibility categories. The numbers in the table are percentage 
of videos. Note that the whole face is only visible in 30.8% of 
all the videos. Based on this data, we can infer that most of 
the time the whole face is not visible. 


Visibility 

Posture 

Whole 

Mouth 

Nose 

Eyes 

No Eyes 

Standing 

39.2% 

38.2% 

18.6% 

4.0% 

0% 

Sitting 

47.0% 

27.5% 

19.1% 

5.9% 

0.5% 

Slouching 

23.0% 

35.8% 

26.0% 

13.2% 

2.0% 

Lying 

13.7% 

39.7% 

35.3% 

7.4% 

3.9% 

All body postures 

30.8% 

35.2% 

24.8% 

7.6% 

1.6% 


Our starting hypothesis was that there might be 
a correlation between facial visibility extent and body 
posture during tablet use. Two main conclusions can 
be derived from Table 1. First, when seeking a refined 
amount of information about facial visibility, body pos¬ 
ture information can be useful. For example, stand¬ 
ing/sitting postures lead to higher probabilities of the 
face being fully visible, compared to slouching/lying. 
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No Reflection Weak Reflection Strong Reflection 

Fig. 6 Example images of different glasses reflection strength. 
Erom left to right, each image represents no reflection, weak 
reflection and strong reflection respectively. 

Table 2 Statistics on eyeglasses reflection strength. We can 
infer that prescription eyeglasses cause reflection in approxi¬ 
mately half of the videos. 


Reflection Strength 

None 

Weak 

Strong 

Number of videos 

49.5% 

24.2% 

26.3% 


Intuitively, the observations make sense based on prac¬ 
tical experience. Most users tend to rest their tablets 
on their chest/abdomen when slouching/lying, which 
reduces chances of seeing the whole face. Although this 
is beyond the scope of this paper, facial visibility extent 
could thus potentially be used to roughly estimate the 
body posture. 

Second, if the only objective is to see the eyes, then 
the eyes are visible in at least 96% of the videos for 
any posture. Thus, for our proposed appearance-based 
method discussed in Section 4, which relies on the visi¬ 
bility of the eyes only, information about body postures 
is not essential. However, for methods that may rely on 
other facial landmarks, the accuracy of gaze estimation 
could be dependent on the body posture. 

Observation 3: Prescription glasses can cause 
reflection, and in many instances, the reflection 
can be significant. 

Fig. 6 shows examples of eyeglasses reflections from 
the TabletGaze dataset. Depending on the viewing angle, 
light source, orientation and coating, there may be no 
glare from the eyeglasses (left most image in Fig. 6) or 
very strong glare (right most image in Fig. 6). 

To quantify how often reflection happens and how 
strong the reflection is, we accumulated information 
on the occurrences and strength of eye glasses reflec¬ 
tions in the eye image. We categorized the videos into 
three broad categories (no reflection, weak reflection 
and strong reflection) by the same method we used for 
face visibiiity categorization. The categorization is done 
for aii the videos of subjects who were wearing giasses, 
and the statistics are iisted in Tabie 2. We observe that 
there is visibie giasses reflection in haif the videos, and 
in 26.3% of the videos, there is a strong reflection. Re¬ 
flections with strong intensities couid potentiaiiy impact 
the gaze estimation accuracy by i) possibiy confusing 
eye detector used in our aigorithms, making it return an 


erroneous bounding box iocation around the eye region, 
and ii) reducing the contrast in some regions of the eye, 
which in turn makes part of the eye, such as iris or sciera, 
invisibie. 

3.3 Sub-dataset Labeiing 

The totai amount of raw data coiiected is 51 x 16 = 816 
video sequences. However, a portion of the data is not 
usabie for three reasons: i) the transition from one gaze 
point to the next and ioss of concentration of subjects 
produce image frames with inconsistent gaze iabeis; ii) 
the eye detector failure in some conditions causes missing 
data, and hi) involuntary eye blinks and large motion 
blur result in images without useful gaze information 
from the eyes. Because of these three reasons, we prune 
the raw data and obtain a sub-dataset of 41 subjects 
to be used in our experiments. Below, we explore the 
three reasons in more detail and describe how we Alter 
out the unusable data. 

We first remove images with inconsistent gaze labels. 
We extract only the video chunk that corresponds to 
1.5 to 2.5 seconds after the time the dot appears at a 
new location to remove the time for subjects to re-focus. 
Since it is unavoidable that sometimes the subject loses 
concentration during a data collection session, the gaze 
label of parts of the corresponding video data can be 
mismatched. For the 35 video chunks extracted from 
each video sequence, we visually inspect whether there 
is a gaze drift for more than 5 video chunks and, if so, 
abandon the data from the whole video sequence. Since 
it is hard to determine the true gaze location just by 
looking at one stand-alone image, we extract one eye re¬ 
gion image for each gaze point and enhance the contrast 
of the image to compensate for the low illumination sce¬ 
nario. By comparing the relative location of the iris and 
openness of the eyes among 35 gaze locations, we are 
able to identify each gaze drift occurrence and calculate 
the total number of gaze drifts. 

We then remove images with eye detector failures. 
For each video chunk of time duration 1 second, the 
number of frames contained is between 15 and 30 due to 
the variable video recording rate of the front camera. An 
important step for automatic estimation of gaze through 
images is to detect the eye region using an eye detector, 
which fails in conditions such as eyes are not visible 
in the image frame, strong reflection from prescription 
glasses, occlusion from hair, poor illumination, and so on. 
Images with eye detector failures are removed, resulting 
in small data size for certain subjects. Examples of eye 
detection success and failure cases are shown in Fig. 8. 

Another source of images without useful information 
is the involuntary blinking and occasional large motion 
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of the subjects during the data collection stage. Since 
the images of closed eyes and blurred eye regions are 
undesired, for image frames within each video chunk 
corresponding to one gaze direction, we extract 5 im¬ 
ages with lower mean intensity value and higher mean 
Laplacian of Gaussian (LoG) value. We do this because 
images of closed eyes will have higher mean intensity 
value given the disappearance of the dark pupil, and a 
blurred eye region image will have a lower mean LoG 
value because motion blur weakens the edge informa¬ 
tion in the image. Even though some video chunks do 
not contain closed eye images, we still extract 5 image 
frames to guarantee a similar number of data samples 
for each gaze point. 

This extensive data selection process removes most 
of the unusable images. The tiny fraction of bad images 
that escape this procedure is treated as noise. 

4 TabletGaze: Gaze Estimation System 

In this section, we describe the proposed TabletGaze, the 
gaze estimation framework that leverages well-known 
machine learning processing modules, as shown in Fig. 
7. The estimation of gaze from an image consists of 
three parts: preprocessing, feature extraction and regres¬ 
sion estimation. The preprocessing part involves image 
normalization (e.g. scaling) so the eyes from different 
images can be directly compared. For feature extraction 
and regression, we utilize a data-driven approach to 
guide the selection of features and regressors. We tested 
five features including contrast normalized intensities, 
Laplacian of Gaussian (LoG), Local Binary Patterns 
(LBP) [30], Histogram of Oriented Gradients (HoG) [5], 
and multilevel HoG (mHoG) [21]. We utilized four re¬ 
gressors, namely /^-Nearest Neighbors (/c-NN), Random 
Forests (RF) [2], Gaussian Process Regression (GPR) 
[42], and Support Vector Regression (SVR) [8]. 

4.1 Preprocessing 

The first step in TabletGaze is to preprocess the input 
images, which have a resolution of 1280 x 720 pixels. 



Fig. 8 Eye detection fails in some scenarios. Top row of 
images, (a-c), show correct eye detection cases. Bottom row 
of images represent eye detection failure cases, including (d) 
strong glasses reflection, (e) hair occlusion over an eye, and 
(f) poor illumination. 

An example of the preprocessing step is displayed in Fig. 
9. We first apply two Harr feature CART-tree based 
cascade detectors [46], one trained for left eye and one 
for right eye, to locate image patches that includes po¬ 
tential left and right eye regions. A sample output of the 
detectors can be found in Fig. 9. False positive bounding 
boxes from the detectors are rejected by 1) empirically 
establishing a threshold for the size of the box to remove 
small false-positive patches, such as the nostril detected 
in Fig. 9, and 2) enforcing coarsely symmetric locations 
of the bounding boxes returned by the left and right eye 
detectors (to compensate for head tilt where eyes are not 
totally symmetric) to remove stand alone false-positive 
patches, such as the mouth detected in Fig. 9. The eye 
region bounding box sizes vary for different images, so 
their sizes are scaled to 100 x 100 pixels. The detected 
bounding box contains a large area including the eye 
brows, which is not informative about gaze, so we crop a 
tight box around the eye to procure the final eye image. 
The pupil center is coarsely located at one half hori¬ 
zontally and two thirds vertically of the bounding box, 
given the aforementioned eye detector was trained with 
eye images of this geometry. We crop 15 pixels from the 
top and bottom around the pupil center to form the final 
eye image, which covers the eye region tightly for most 
subjects. The horizontal dimension is untouched since 
the eye width varies widely among different subjects. As 
a result of the aforementioned operations, the final eye 
image size becomes a fixed 30x100 pixels for each eye 
across all images. A few sample images of the cropped 
eye regions are shown in Fig. 10. Blinks cause the gaze 



Fig. 7 Automatic gaze estimation algorithm. The input to the algorithm is an image of the user recorded by the tablet front 
camera. The output is the location on the tablet screen at which the user is looking. The estimation of gaze from an image 
consists preprocessing, feature extraction and regression estimation. 
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Fig. 9 Example images in the preprocessing phase. Eirstly, 
two eye detectors are applied to detect potential left and right 
eye regions. The blue bounding boxes denote the output of 
left eye detector, while the red of right eye detector. In the 
example image, we can observe false positive image patches 
around the nostril and mouth, which are removed to find the 
true eye region. Then only a tight region around the eyes is 
used to avoid the ambiguity caused by eye brows and facial 
expressions. 


estimation algorithm to produce incorrect predictions 
and need to be removed. To detect blinks, the algorithm 
looks for changes in the mean pixel intensity of the eye 
region over time. The algorithm takes advantage of the 
fact that when an eye blink occurs, the continuous dis¬ 
appearance and reappearance of the dark pupil results 
in an increase, then decrease of the mean pixel intensity. 
The mean is taken over 20 consecutive frames, which 
is usually less than the time length between two con¬ 
secutive blinks. By inspecting the video sequences, we 
found that an eye blink usually lasts around 4-6 frames. 
Therefore, when a blink is detected, we skip 6 frames 
around the peak frame. 


4.2 Feature Calculation 

Following eye extraction, we next find features. Feature 
calculation includes two steps: feature extraction and 
dimensionality reduction. 

Feature Extraction: The accuracy of gaze esti¬ 
mation greatly depends on the feature we choose. To 
ensure our algorithm achieved a state-of-the-art result, 
we chose to evaluate the performance of 5 popular fea¬ 
tures: (1) contrast normalized pixel intensities; (2) LoG; 
(3) LBP; (4) HoG; and (5) mHoG feature. The first 
proposed feature, contrast normalized pixel intensities, 
is the simplest feature of the five; it converts pixel values 
into the feature vector after normalization to account for 
variations in illumination. LoG convolves each eye image 
with a LoG filter and concatenates the returned vector 
to enhance eye contour and remove person-dependent 
eye region texture information. LBP and HoG have been 
proven by many works as powerful features [6]. LBP 



Fig. 10 A few sample images of the final extracted eyes data. 
Each row of eye images comes from one subject. We observe 
that after the preprocessing step, the eyes are tightly cropped. 


captures image texture information, while HoG retrieves 
local shape and orientation information. As a variant of 
HoG, multilevel HoG (mHoG) is formed by concatenat¬ 
ing HoG features at different scales. The block scales 
utilized in this paper are the same as presented in [22]. 

Dimensionality Reduction: Features obtained in 
the feature extraction phase suffer from being high di¬ 
mensional and compromised by noise. We overcome 
these problems by mapping the features to a lower di¬ 
mensional space. In this work, we applied Linear Dis¬ 
criminant Analysis (LDA) to reduce the feature dimen¬ 
sionality. LDA maps the data to a lower dimensional 
space where the inter-class scatter to intra-class scatter 
ratio is maximized. Finding the projection vector re¬ 
quires computation of the inverse of intra-class scatter. 
The intra-class scatter matrix suffers from a singularity 
problem when the number of data samples per class is 
smaller than the number of features. Regarding this, 
we applied Principal Component Analysis (PGA) to 
the original feature data to reduce its dimension. The 
dimension is reduced to no smaller than the number of 
observations per gaze point. Then we apply LDA to the 
already reduced data to obtain a final feature vector. 
Given input data of feature length C, the output data 
of LDA will have a length of C — 1. In our dataset, we 
have gaze data corresponding to 35 gaze locations, so 
the final data after the LDA operation has a feature 
length of 34. 


4.3 Regression 

Finally, after computing the final feature vectors, the 
data is fed into a regression model. The gaze labels of 
the data include two parts: the horizontal and vertical 
(x and y) coordinates on the tablet screen. We trained 
a separate regressor for the horizontal and vertical gaze 
locations respectively. Then the output from the two re¬ 
gressors are combined as the predicted 2D gaze location 
on the tablet screen. In our work, we experimented with 
the four different models mentioned earlier. /c-Nearest 
Neighbors (/c-NN) assigns the average of the output of 
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the k nearest neighbors in the training data to a new 
observation; we chose k = 3 in our experiments. 

Random Forests (RF) are a set of weak binary tree 
regressors. Each tree in the forest is grown by randomly 
boostrapping samples and each binary split of the tree 
is grown by randomly selecting a subset of the features. 
For regression RF, the output of a new input is given by 
the average of the output of each tree in the forest. RF 
has previously been used in gaze estimation papers and 
shows strong performance [37, 45]. In our experiments, 
we used 100 trees. 

Gaussian Process Regression (GPR) models the re¬ 
gression problem as a Gaussian process and estimates 
the output of a new observation by taking the condi¬ 
tional probability over the training data. The advantage 
of GPR is that it returns not only the estimate of the 
output, but also the confidence interval of the estimate. 
However, traditional GPR has a complexity of 0{N^) 
for an input data samples size N [33], which makes it 
computationally infeasible for a large dataset, such as 
the over 100,000 samples in our data. In our experi¬ 
ments, we used fully independent training conditional 
(FITC) approximation [28], a sparse GPR method which 
claims to achieve similar accuracy as GPR, to reduce 
the running time. Even with the faster FITC approxima¬ 
tion, we could only manage to evaluate on 15 subjects 
with a reasonable computing time using three-fold cross 
validation. 

Support Vector Regression (SVR) utilized the well- 
known ’’kernel trick” to project data into a higher di¬ 
mensional space where a linear regression function can 
effectively fit the data. A nonlinear kernel can trans¬ 
form a non-linear regression problem in the original 
data space into a linear one in the new space. In our 
experiment, we employed the popular non-linear radial 
basis function (RBF) kernel. The performance of SVR 
depends highly on the model parameters, which are 
usually obtained through a coarse to fine grid-search 
process. Given a data sample size V, SVR has a training 
time complexity of 0{N^) [39], which greatly limits its 
scalability to large datasets. In our experiments, we eval¬ 
uated SVR on the subset of 15 subjects that was used 
in GPR evaluation. The evaluation was also conducted 
using three-fold cross validation. 


5 Results and Analysis 

5.1 Error Metric 

Previous works on gaze estimation employed an angular 
error to evaluate the quality of gaze estimation. The 
angular error is computed by taking the arctangent of 


the ratio between the distance from the subject’s eyes to 
the screen center and the distance of the gaze point from 
the screen center. However, in the mobile environment, 
the distance between the screen and the user is highly 
variable, so it is not possible to reliably calculate the 
angular error. For our work, since we have the ground 
truth gaze labels (2D location on the table screen) for all 
data, we define the estimation error of one data sample 
as the Euclidean distance between the predicted 2D gaze 
point location and the actual 2D gaze point location 
on the tablet screen. The final error is reported as the 
mean error (ME) over all data samples. 


5.2 Comparisons for Different Eeatures + Regressors 

In Table 3, we first summarize the performance of each 
feature and regressor as described in Section 4.2 and 
4.3. The entries in Table 3 are the MEs (in cm) across 
around 100,000 images from 41 subjects using the cross 
validation method described in Section 4.3. The columns 
of Table 3 represent different features. The features are 
listed in order of increasing complexity, and this trend 
can be seen in the table - estimation accuracy gener¬ 
ally increases as feature complexity increases regardless 
of the classifier used. Note that the complex texture 
feature, EBP, performs no better than the simple edge 
feature LoG, and delivers far inferior performances as 
complex shape and edge features, HoG, and mHoG. We 
hypothesize that the lack of performance improvement 
is because the shapes and edges, such as those from 
the limbus and sclera, communicate more information 
about the gaze location than texture does. Note that, 
mHoG and HoG achieve the best results and mHoG 
performs slightly better than HoG, while other features 
yield far worse results. Moreover, the computation of 
mHoG feature is fast due to the utilization of integral 
histograms [32]. 

The rows of Table 3 represent different regressors. 
We notice that the best two results both come from 
the RE regressor. In addition, RE provides fast predic¬ 
tion results, thus has been widely adopted in real time 
systems [9]. In our experiments, we actually found the 
results were reasonably stable when using more than 20 
trees; we used 100 trees to further improve accuracy. 

Overall, mHoG and RE achieve the lowest error of 
3.17 ± 2.10 cm, as listed in Table 3. A few example 
images with high estimation error are shown in Fig. 11. 
Even considering the computational complexity (e.g. for 
real-time applications), mHoG and RE are still recom¬ 
mended for their relatively fast computation. This is our 
chosen algorithm for the experiments in the following 
sections. 
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Low Error 

0.3 


0.1 


0.2 


0.3 


0.5 




0.8 0.8 1.3 1.3 1.5 


Medium Error 

3.2 3.5 3.6 4.0 4.1 



4.3 4.4 4.5 4.6 4.7 


High Error 

7.0 7.3 7.4 8.1 8.5 



8.6 8.8 9.2 9.3 9.8 

Fig. 11 Example eye images with different gaze estimation errors. In the figure, we show 10 pairs of eyes for low, medium, and 
high estimation errors (in cm) using mHoG + RE gaze estimation algorithm. We can observe that factors such as erroneous eye 
region cropping, long eyelashes, strong refiections from prescription glasses, glass frames, rotated eyes, and motion blur can 
reduce estimation accuracy. 


Table 3 Mean error (cm) for each feature and regressor 
combination. Note that the combination of mHoG feature and 
RE regressor achieved the lowest error. 


Features 

Regressors^^^^^^^^ 

Raw pixels 

LoG 

LBP 

HoG 

mHoG 

k-NN 

9.26 

6.45 

6.29 

3.73 

3.69 

RF 

7.20 

4.76 

4.99 

3.29 

3.17 

GPR 1 

7.38 

6.04 

5.83 

4.07 

4.11 

SVR 2 

X 

X 

X 

X 

4.07 


5.3 Person-Dependent and Person-Independent 
Performance Comparison 

In prior appearance-based gaze estimation methods [38, 
43], the evaluation process of the algorithm use data from 
the same subject and same session for both training and 
testing (person and session dependent). Here, we study 
only the influence of person-dependency, not session- 
dependency, on algorithm performance. The analysis of 
session-dependency is not useful because in daily use, a 
person’s appearance can vary widely between sessions. In 
a person-dependent model, the performance is evaluated 
using leave-one-session-out cross validation on the data 
from the same person (each person has 16 sessions). 
In the person-independent model, a leave-one-subject- 

^ Due to training time complexity constraint, GPR is evalu¬ 
ated using three-fold cross validation on data of 15 subjects, 
which is essentially leave-5-subjects-out. 

^ SVR is evaluated only on the optimal feature, which is 
mHoG. The evaluation process is conducted in the same way 
as GPR. 


out cross validation is employed. In each one of the 
41 evaluation rounds (the TabletGaze dataset includes 
41 subjects), the regressor is trained on data from 40 
subjects and tested on the remaining one subject, and 
then the final results are obtained by averaging the 
estimation errors of all the images from the 41 subjects. 

Fig. 12(a) shows the estimation error histograms 
over all the images in the sub-dataset. We observe 
that for person-dependent training scenario, the esti¬ 
mation errors aggregate near lower values compared to 
person-indpendent training scenario. The observation 
implies that for the person-dependent training scenario, 
the estimation error is lower than that in the person- 
independent training scenario. The numerical MEs over 
all samples in the sub-dataset are shown in Fig. 12(b) 
for the two training scenarios. This result is expected be¬ 
cause the regressor will have better generalization power 
for images from the same person, due to the stronger 
similarity between the images. 

We also present the stand alone horizontal and verti¬ 
cal errors (x and y coordinates on the tablet screen), in 
addition to the overall/combined ME for both person- 
dependent and person-independent training scenarios 
in Fig. 12(b). The horizontal and vertical errors are 
both evaluated using mean absolute error (MAE) to 
avoid the cancellation of positive and negative errors. 
Unidirectional gaze estimation might be useful for appli¬ 
cations that requires only information from a singular 
direction, such as web-page scrolling. We observe that 
the horizontal and vertical errors are similar, showing 
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(a) Error histograms 

■ Horizontal MAE 

□ Vertical MAE 
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2.50 



Person-Independent Person-Dependent 


(b) Bar plot comparison 


Fig. 12 Person-independent and person-dependent train¬ 
ing performance comparison. In person-independent training, 
leave-one-subject-out cross validation was utilized to evaluate 
the algorithm; while in person-dependent training, leave-one- 
session-out cross validation was employed. 


that the horizontal and vertical regressors have similar 
predictive powers. 


5.4 Comparison with Prior Results 

In this section, we compare our results with those re¬ 
ported in previous works. In order to conduct this com¬ 
parison, we convert our distance error into angular error. 
We find that the distance between the subject’s eyes 
and the tablet varied from 30 to 50 cm. For the sake 
of comparison, we compute the error for that range of 
distances. Given the error distance on tablet screen E 
and the tablet-to-eyes distance D , the angular error of 
the algorithm is derived by calculating the arctangent 
of the ratio of the error distance to the distance from 
the user to the tablet arctan{^). 

The first work against which we compare our re¬ 
sults is the work done in [44] for mobile tablets using 
a geometry-based gaze estimation approach. Note that 
the authors used 9 gaze locations covering only part of 


References 

Features 

[17]i 

[17]^ 

[28] 

[27] 

This Work 

For Mobile Device 

X 

X 

^3 


V 

Calibration Free 

X 

X 

X 

X 

V 

Unconstrained Head Pose 

X 





Unconstrained User- 
screen Distance 

X 

X 




Automatic Preprocessing 



^4 



Person-independent 

N/A 

N/A 

6.30° 

N/A 

3.63-6.03° 

Person-dependent 

<r 

2.37° 

3.20° 

6.88° 

2.86-4.76° 


Fig. 13 Comparison with prior works. We can observe that 
our method is free of all the constraints. Moreover, our method 
achieves better accuracy than Wood et. al [44], and compar¬ 
ative accuracy than other works both for person-dependent 
and person-independent trainings. 


the tablet screen. Moreover, the data was collected with 
a fixed user-tablet distance and the result was reported 
for a person-dependent study. The second comparative 
work [47] is a study of appearance-based gaze estimation 
for laptop screens. While the participants for the study 
used the laptops freely, the variation of user posture 
on laptops is lower than that on tablets. In addition to 
eye appearance, head pose was utilized as an input to 
the algorithm because the full face is visible in the im¬ 
age frames at all times. Our tablet usage environments 
draw a sharp contrast to the laptop usage environments, 
since the face is not entirely visible 69.2% of the time 
for a tablet user, and thus eliminating the possibility 
of reliable head pose estimation. Finally, work done in 
[19] claimed a state-of-the-art estimation accuracy for 
the condition of sparse training samples. In their ex¬ 
periments, they used fewer than 40 training samples 
per person for a person-dependent training scenario. 
Their algorithm was evaluated both on data of fixed 
frontal head pose and on data of slight head motion for 
8 subjects. However their method is not easily scalable 
to large data problems, and they also focused only on 
person-dependent training. 

A detailed comparison between our proposed tech¬ 
nique and prior works is shown in Fig. 13. In addition 
to quantitative error, other relevant properties, such 
as whether requires calibration and restricts head mo¬ 
tion, are also listed. We can observe that our method 
is the only one that is free of all the constraints. More¬ 
over, our method achieves an angular error competitive 
with other works for both person-dependent and person- 
independent training scenarios. 


^ The head pose is fixed (no head motion). 

^ The head motion is limited to several degrees. 

^ The data is collected for a laptop. 

^ The method requires a mean facial shape model built 
across all subjects. 
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Fig. 14 Effect of training data size on the gaze estimation 
accuracy of TabletGaze. The round circles are results obtained 
in the experiment, and a line is fitted to the data points. The 
triangles are data points derived through extrapolation on the 
fitted line. We can see that the ME decreases monotonically 
as the number of training subjects grows larger, indicating 
that more data could improve the performance further. 


5.5 Effect of Training Data Size 


Fig. 15 Diagram of the design for the three experiments 
studying the factor of prescription glasses. In Experiment 1, 
the dataset was partitioned into 2 groups of wearing glasses 
(Group 1, N subjects) and not wearing glasses (Group 2, N 
subjects), and training and testing were done separately for 
each group. In Experiment 2, the leave-one-subject-out cross 
validation were conducted on all data (2 x N subjects), but 
the ME was separated for each group. In Experiment 3, we 
combined data of half of the subjects from Group 1 and half 
from Group 2, and conducted training and testing within the 
combined data. 


In this section, we study the impact of training data size 
on the estimation accuracy of TabletGaze. We randomly 
select groups of different number of participants for 
evaluation. We experiment with groups of different sizes 
iT, where K is within the range [2,41]. For each group, 
we perform leave-one-subject-out cross validation, so in 
each training round we use K — 1 subjects’ data. Since 
we are randomly selecting a subset of data from the 
whole data, we repeat the same process 5 times and 
average the final reported errors to reduce bias. 

The results are presented in a semi-log plot as shown 
in Fig. 14. As the size of the training group increases, 
the estimation error decreases monotonically. The mono¬ 
tonically decreasing relationship suggests that if we use 
more training subjects, we can further improve estima¬ 
tion accuracy. 


5.6 Eyeglasses, Race and Posture 

We validate whether dividing the dataset into groups 
based on person-related factors and training a separate 
regressor for each group would further reduce the esti¬ 
mation error. Our hypothesis is that the eye appearance 
variations caused by factors other than gaze can be re¬ 
duced within each group. Previous works on head pose 
estimation [13, 48] and face detection [17] demonstrated 
improved accuracies by dividing the data into groups 
and training a regressor/detector for each group. At the 
same time, we also examine the impact of each factor 
on gaze estimation accuracy. Due to a lack of sufficient 
data in some of the categories, for example we have only 
six subjects who are Caucasians and wearing glasses, we 


could not perform controlled tests to study the impact 
of each independent factor. Nevertheless, we can still 
gain some initial understanding of the impact of the 
three factors on the performance of the gaze estimation 
algorithm. Three factors are considered in our study: 
eyeglasses (wearing eyeglasses or not), race (Caucasian 
or Asian), and body posture (standing, sitting, slouch¬ 
ing or lying). Three experiments are conducted for each 
factor. 

5.6.1 Eyeglasses 

We first discuss the impact of eyeglasses. A diagram 
of the experiment design for the three experiments is 
shown in Fig. 15. The dataset is first divided into two 
groups: Group 1 is wearing glasses, and Group 2 is not. 
In the first experiment, leave-one-subject-out cross vali¬ 
dation is evaluated on the data of each group separately, 
and the estimation errors are obtained for each group. In 
our data, there is an unequal number of subjects within 
each group. To solve this problem, suppose Group 1 has 
M subjects and Group 2 has N subjects, where M is 
larger than N. Then we randomly select N subjects 
from Group 1 and run Experiment 1. We repeat the 
experiment 5 times and average the ME for Group 1 
to reduce bias caused by random selection. The second 
experiment is conducted on data from both groups using 
leave-one-subject-out cross validation. The estimation 
error is separated depending on whether the test sub¬ 
jects are wearing glasses or not. In Experiment 1, the 
number of training subjects is smaller than the num¬ 
ber of training subjects in Experiment 2 due to data 
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Exp. 1 Exp. 2 Exp. 3 Exp.l Exp.2 Exp.3 Exp.l Exp.2 Exp.3 

Glasses Race Posture 


Fig. 16 Study of whether partitioning the data based on person-related factors would reduce estimation error. The error 
obtained in Exp. 2 is lower than that in Exp.l for all the three factors. It means that when we have limited training subjects, data 
partition increases the estimation accuracy. The error obtained in Exp.3 is higher than that in Exp. 1 for racial characteristics 
and body posture, and almost the same for wearing glasses or not. It infers that when we have sufficient training subjects, data 
partition based on the factor of race and boy posture improves the estimation accuracy, while the factor of glasses does not 
significantly impact the result. 


partitioning. We can infer that this size discrepancy 
will have a negative impact on the estimation accuracy, 
as discussed in Section 5.5. To mitigate the effects of 
training data size, in Experiment 3 we choose the same 
training data size as in Experiment 1. We randomly 
select N/2 subjects from Group 1 and N/2 subjects 
from Group 2, and combine the data in Experiment 3. 
The evaluation process is done using the same method 
as in Experiment 2. Experiment 3 is also repeated 5 
times to reduce the bias caused by the random selection 
of training subjects. 


The results are shown in the first bar plot of Fig. 16. 
As we can observe from the bar plot, the ME of the 
group of wearing glasses is larger than the group of not 
wearing glasses for all the three experiments. We can also 
observe that in Experiment 1, the ME increases around 
0.4 cm for the group of wearing glasses compared to the 
group of not wearing glasses. These observations means 
that wearing glasses has a negative impact on gaze 
estimation accuracy. We can tell from the bar plot that 
for each group, the ME of Experiment 1 is higher than 
the ME of Experiment 2. The increase of error means 
that partitioning the data does not improve accuracy 
when we have limited number of training subjects. We 
obtain similar ME for Experiments 1 and 3, showing that 
partitioning the data based on the factor of glasses does 
not have a significant impact on estimation accuracy 
when we have sufficient training data. The factor of 
glasses does not affect estimation accuracy most likely 
because sometimes the reffection from glasses is not 
strong and does not introduce much noise in the eye 
images. 


5.6.2 Race 

We utilize the same approach as in Section 5.6.1 to 
design the three experiments to study the impact of 
racial characteristics. The second bar plot of Fig. 16 
shows the results. We obtain quite different MEs for the 
group of Caucasians and the group of Asians, which tells 
that the factor of race impacts the performance of the 
gaze estimation algorithm. We also notice that for each 
individual group, the ME of Experiment 1 is higher than 
the ME of Experiment 2 while the ME of Experiment 1 is 
lower than that of Experiment 3. We can infer that that 
partitioning the data does not improve accuracy when 
we have limited number of training subjects. Moreover, 
when we have a large amount of training data, dividing 
the data based on raceimproves accuracy because people 
within the same racial group have similar eye shapes. 

5. 6.3 Body Posture 

For studying the impact of body posture (standing, sit¬ 
ting, slouching or lying), the data partition is performed 
differently. As described in Section 3.1, we have four 
subsets of data for each posture for each subject. We 
partition the dataset into 4 groups, each group contain¬ 
ing data of one body postures from the same subjects. 
For each subject, the data size for each body posture 
may be unequal due to occasional unusable data. To 
reduce the effect of unequal dataset size, we choose a 
subset of 29 subjects from the TabletGaze dataset where 
the amount of each subject’s data for each body posture 
is almost equal. Then we perform the three experiments 
in the same way as described earlier in this section. 

The results are shown in the third bar plot of Fig. 16. 
We notice that the MEs of the standing, sitting and 
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slouching groups are quite similar, while the ME for the 
group of lying is the highest. One reason for the high 
error of the group of lying is that people have more 
varied head pose and way of holding the tablet when 
they are lying. We also notice that for each individ¬ 
ual group, the ME of Experiment 1 is higher than the 
ME of Experiment 2 while the ME of Experiment 1 is 
lower than that of Experiment 3. We can infer that that 
partitioning the data does not improve accuracy when 
we have limited number of training subjects. Moreover, 
when we have a large amount of training data, dividing 
the data based on body posture improves accuracy be¬ 
cause people might have similar head poses when they 
are doing the same posture. 

5.7 Continuous Gaze Tracking from Videos 

We apply the TabletGaze algorithm to continuously 
track user’s gaze on videos in the sub-dataset. Initially 
we directly estimate user’s gaze in the videos on a frame- 
by-frame basis. When implementing a continuous gaze 
tracking system, temporal information can be utilized to 
further reduce gaze prediction errors. A temporal bilat¬ 
eral filter can be applied on consecutive gaze estimations 
to reduce the miniature fluctuation of neighboring gaze 
estimations caused by model noise, and preserve the 
large gaze shifts due to change of fixation location. Some 
example images of the continuous gaze tracking based 
on our TabletGaze algorithm, and the effect of bilateral 
filter are displayed in Fig. 17. We can observe that for 
each gaze location in the image, the gaze estimations 
are close to the ground truth gaze locations, and the 
errors are less than 3.4 cm (the distance between two 
cross-stiches), which conforms to the ME of 3.17cm. We 
also notice that after applying the bilateral filter, fluc¬ 
tuations of the gaze estimations for each ground truth 
gaze location are decreased. Meanwhile, temporal eye 
center location information can be collected, and the 
change of subsequent eye center locations can be used to 
correct gaze estimations. For example, sometimes a user 
naturally moves his/her head from left to right when 
he/she is looking from left to right on the screen. Along 
with the head movement, the eye center location would 
also shift to the right. This shift of eye location can thus 
be utilized to correct neighboring gaze estimates so the 
predicted gaze location also changes accordingly. 

6 Discussion and Conclusion 

All of the evaluations of the algorithm are conducted on 
a desktop computer. When implementing the algorithm 
on a tablet, the RE regressor can be pre-trained off-line 
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Fig. 17 Gontinuous gaze tracking demonstration. Each image 
shows a part of the tablet screen, and the cross-stitches of 
the grid lines represent the 35 potential ground truth gaze 
locations. In (a), 3 ground truth gaze locations and the color- 
coded predicted locations are shown for a single subject from 
our dataset. Each location is showed in the sequence indi¬ 
cated by the arrows. The distance between the predicted gaze 
locations and the true gaze location is within the distance 
between two cross-stiches (3.4 cm). In (b), the predicted loca¬ 
tions are passed through a bilateral filter; the fluctuations of 
the predictions are reduced by the Alter. 

and loaded onto the device. The computation of the 
mHoG feature from an image and prediction using the 
RE model is fast, which means real-time estimation is 
possible. An explicit 3D head pose is not utilized in 
this work. Here we discuss an exploratory experiment 
regarding incorporating implicit head pose information. 

As we discussed earlier, direct 3D head pose informa¬ 
tion cannot always be obtained for the mobile environ¬ 
ment due to partial facial visibility in some cases. But 
head pose information is correlated with features such as 
the location of the eye center in the image frame and the 
size of the eyes, which can be extracted as alternatives 
to exact head pose angles. To utilize this information, 
we design a feature vector composed of the following 
features: the x and y coordinates of the left and right 
eyes, eye sizes (width and height of the eye bounding 
boxes), and the x and y location difference between the 
left and right eyes. This feature vector has a length 
of 10 and is combined with the EDA reduced mHoG 
feature as an input to the RE regressor. The data is also 
evaluated using leave-one-subject out cross validation, 
and we obtain a ME of 3.10 ± 2.07 cm. There is no 
significant improvement compared to the 3.17 ± 2.10 
cm ME when we do not use the eye location informa¬ 
tion. This means directly adding these features does not 
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result in significantly improved estimating accuracy. A 
future direction could be focused on designing a new 
scheme to appropriately and productively incorporate 
eye location information. 

In conclusion, this work presented and studied the 
unconstrained mobile gaze estimation problem in three 
major steps. Firstly, a large dataset was collected in an 
unconstrained environment. To the best of our knowl¬ 
edge, this is the first dataset of its kind. The dataset 
is designed to explore the variation of subject appear¬ 
ances in an unconstrained environment by including 4 
different postures and recording the data in videos. 3 
observations were made on the dataset, including facial 
visibility, posture, and glasses reflection, which provide 
a deeper understanding of the challenges present in the 
mobile environment. An automatic gaze estimation al¬ 
gorithm is presented, composed of currently available 
computer vision tools, which can estimate a persons 
gaze from an image recorded by the tablet front camera. 
The algorithm achieves a ME of 3.17 ± 2.10 cm on the 
tablet screen, which is a significant improvement over 
prior works on mobile gaze estimation. The result is 
good for applications that do not require high accuracy 
on the tablet. 
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